Bulletin of Electrical Engineering and Informatics 
Vol. 11, No. 6, December 2022, pp. 3570~3576 
ISSN: 2302-9285, DOI: 10.1159 1/eei.v1 116.4353 O 3570 


K-Means clustering-based semi-supervised for DDoS attacks 
classification 


Mahdi Nsaif Jasim!, Methaq Talib Gaata? 


'Department of Business Informatics, University of Information Technology and Communications, Baghdad, Iraq 
"Department of Computer Science, Mustansiriyah University, Baghdad, Iraq 


Article Info ABSTRACT 

Article history: Network attacks of the distributed denial of service (DDoS) form are used to 
; disrupt server replies and services. It is popular because it is easy to set up and 

Received Jul 4, 2022 challenging to detect. We can identify DDoS attacks on network traffic in a 

Revised Aug 5, 2022 variety of ways. However, the most effective methods for detecting and 

Accepted Aug 16, 2022 identifying a DDoS attack are machine learning approaches. This attack is 


considered to be among the most dangerous internet threats. In order for 


supervised machine learning algorithms to function, there needs to be tagged 
Keywords: network traffic data sets. On the other hand, an unsupervised method uses 
network traffic analysis to find assaults. In this research, the K-Means 


CICIDS 2017 clustering algorithm was developed as a semi-supervised approach for DDoS 
Clustering . . classification. The proposed algorithm is trained and tested with the 
Distributed denial of service CICIDS2017 dataset. After using the proposed hybrid feature selection 
Feature selection methods and applying multiple training, testing, and carefully sorting DDoS 
K-Means algorithm traffic through a series of experiments, the optimum 2 centroids were found 
Network security to be DDoS and normal. The generated centroids can be used to classify 
network traffic. So the proposed method succeeded to cluster the network 
traffic to safe and theat. 
This is an open access article under the CC BY-SA license. 
(@xolol 
Corresponding Author: 
Mahdi Nsaif Jasim 


Department of Business Informatics, University of Information Technology and Communications 
Baghdad, Iraq 
Email: mahdinsaif @uoitc.edu.iq 


1. INTRODUCTION 

A distributed denial of service (DDoS) attack is a form of a denial of service (DoS) attack in which 
the attacker targets the victim by utilizing the IP address of an authorized user. The numerous DDoS assaults 
consist of SYN-flood, ACK-flood, UDP-flood, connection DDoS, DNS reflect, and ICMP flood, among others 
[1]. An attack's primary goal is to prevent its intended recipients from making use of its intended services by 
overloading those resources. One tactic attacker uses to accomplish this is to send a barrage of fake requests 
through the network. DDoS is launched from multiple computers simultaneously. By overwhelming the 
infrastructure that surrounds the internet traffic flow, a DoS attack is a malicious technique that interferes with 
the regular traffic and networking operations of a targeted server. The rate and volume of network traffic sent 
to the target closely correlate with the attack's severity [2]. 

Since the 1990s, sophisticated intrusion detection systems have been made with the help of data 
mining. Data mining techniques in general, and machine learning techniques in particular, must be applied in 
five steps: selection, preprocessing, transformation, mining, and interpretation [3], [4]. Out of all the ways to 
find intrusions using data mining, these three important steps are the hardest. There are three types of machine 
learning-based DDoS detection methods that are already in use. Supervised ML approaches that build the 
detection model from datasets of network traffic that have been generated and labeled. The supervised 
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approaches have to deal with two big problems. First, making labeled network traffic datasets takes a lot of 
time and computing power. Without constant model updates, supervised machine learning techniques cannot 
predict novel actions that are simultaneously safe and risky. Second, supervised ML classifiers don't work as 
well when there is a lot of abnormal data in the traffic of the network. This is called noise. In the second group, 
there is no need for a labeled dataset to build the detection model, which is different from the first group. The 
main problem with the unsupervised methods is that they give out a lot of false positives. The curse of the 
dimensionality problem [5] makes it hard for unsupervised methods to find attacks accurately [6]. By being 
able to work on both labeled and unlabeled datasets, semi-supervised ML concepts take advantage of both 
supervised and unsupervised techniques. Also, using both supervised and unsupervised methods together can 
improve accuracy and reduce the number of false positives. But the problems with both approaches also make 
it hard for semi-supervised approaches to work. So, semi-supervised approaches need to have their parts put 
together in a smart way to make up for the problems with supervised and unsupervised approaches. 

A group of machine learning tasks and techniques known as "semi-supervised learning" combine 
labeled with unlabeled samples for training, frequently combining a little amount of labeled samples with a 
large number of unlabeled samples. Semi-supervised learning way lies in the middle between supervised and 
unsupervised learning. Numerous machine learning researchers have demonstrated that integrating small 
amounts of labeled data with unlabeled data can dramatically improve learning accuracy compared to 
unsupervised learning without the time and expense of supervised learning. The general rule is first explored 
using labeled data in a semi-supervised learning process, and then the rule is applied to infer unmarked data. 
The machine learning algorithm that is enhanced for intrusion detection [7]. 

The primary goal of this work is to locate an appropriate method for classifying DDoS attacks by making 
use of semi-supervised learning and basing it on a global DDoS dataset. In addition to locating the most effective 
centroids for application in the offensive classification. The following are some of the benefits of our proposed 
algorithm over earlier detection solutions using supervised learning and unsupervised learning approaches: 1) 
fewer labeled samples are needed to train detection models with our proposed method than with supervised 
learning detection algorithms, ii) proposed hybrid feature selection method using both low variance filter and 
information gain ration techniques, iii) present DDoS and regular centroids to assist in the implementation of them 
online for traffic classification. Following is a summary of the remaining sections of this paper. The related works 
in DDoS attack detection are introduced and their limitations. Our detection model, built on a semi-supervised 
clustering algorithm, is presented in section 2. Following the results and analyses of the experiments and a 
discussion of their significance, the paper concludes with recommendations for further research. The detection of 
DDoS attacks has been proposed using a variety of different methods such as [8]-[10]. Techniques based on 
machine learning are the ones that appear most frequently in published works of research. Table 1 (in Appendix) 
provides a brief overview of some recent research and developments in DDoS detection. 


2. THE PROPOSED METHOD 

In the beginning of this part, the dataset utilized in this study is described. Then, the proposed method 
used for intrusion detection and proposed centroids clustering, are present as shown in Figure 1. Finally, the 
results are analyzed and discussed. 


2.1. Description of the dataset 

Sharafaldin et al. [21] suggested the CICIDS2017 to get around the fact that there aren't enough IDS 
datasets that satisfy criteria of real-world network traffic [22]. The valid and widely used dataset CICIDS2017 
[23], which is the largest and most used dataset [24]. 20% from the CICIDS2017 dataset is used in current 
work to train the machine learning algorithm. This set of data includes 84 features, as well as both unattack 
traffic and attack traffic. The CICIDS2017 dataset has a lot of information with a high-class imbalance. 


2.2. _K-Means clustering algorithm 

A vector quantization technique known as "k means" try to group n observations in order to create k 
clusters, where every one observation belongs to one cluster that has the nearest mean (also known as the 
cluster centroid or cluster centers), which acts as the cluster's prototype [25]. The both algorithms (Hierarchical 
clustering and K-Means) frequently use canopy method as a preprocessing step in their respective processes 
[26]. Its purpose is toincrease the speed at which clustering operations are performed on large data sets, where 
it may be impractical to use another algorithm directly due to the volume of the dataset. 


2.3. Feature selection methods 

One of the most common problems researchers’ encounters is choosing which features are most important 
and thus relevant for use in detecting attacks. Feature selection is critical because it affects how well the system 
works. Too few features may be guide to subpar detection accuracy, while too many may lead to excellent detection 
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accuracy at the expense of an overly complex system that eats up more resources. This work employed two attractive 
features selection techniques; Figure 1 represent the main diagram of proposed framework. 


2.3.1. Variance filter feature selection technique 

The low variance filter method [27] was used to choose the features that were used in this paper, since 
all of the attributes were numbers. The method was used to exclude features with low variances that contributed 
slight or nil to the model's overall performance. Calculating the variance of each characteristic is involved (1). 


YL (Xi -w 


Variance (0?) = = 


(1) 
where y is the average of all the values that are associated with the attribute. The attribute values, denoted by 
Xi, are taken from a collection of data, where N is the total number of samples. 


2.3.2. Information gain 
Due to its usefulness and importance in detecting a class type, the IGR [28] is also employed as a 
weight for attributes in this work (2). 
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IGR(Y,A;) = (2) 


where Y represents the class and Ajthe index of j” attribute. The entropy function, H (.), is defined as follows: 


H(X) = — Vie p(%) log p(x) (3) 


Given an input, the probabilities can be expressed as where P(.) represente the probability operator 
and i represente an index of the probabilities. 


2.4. Proposed centroids clustering 

The proposed method is the use of semi-supervised K-Means Clustering to generate multiple centroids 
that can be used to classify traffic as either safe or malicious. Starting with the selected CICIDS2017 dataset, 
we use the K-Means algorithm to produce semi-supervised centroids for detecting DDoS attacks. The idea of 
semi-supervised involves the use of small number of labelled data for the purpose of labeling larger data sets. 
Figure 1 shows semi-supervised framework diagram. 
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Figure 1. Proposed framework 
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The main processes in proposed framework are illustrated as follows: 

a. The features that were chosen using hybrid the feature selection algorithms. In this work, the variance 
scores and the information gain were used to discover the perfect list of features. By applying variance to 
exclude useless features with a variance score less than 3. In addition, discarding features with a minimum 
weight of 0.6 from the information gain, then 15 selected features are produced, as shown and listed in 
Table 2. Note the variance values for all the data ranges (0 to 9.99E+14) for (Bwd PSH Flags and Fwd IAT 
Total) features respectively. 

b. Utilize the K-Means algorithm to generate the appropriate centroids. 20% of the CICIDS2017 dataset was 
used to train the proposed method to generate centroids, and the remaining 80% of the dataset was used to 
test generated centroids. 
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c. Compare the results with the accuracy scores and select the best result. 


Table 2. Features scores using info.gain 


Feat No. Feat Name Feat Score 
1 SubflowFw Bytes 0.939343 
2 TotalLength of FwdPackets 0.939343 
3 AveragePacketSize 0.80995 
4 TotalLength of BwdPackets 0.782456 
5 SubflowBwdBytes 0.782456 
6 BwdPacketLengthMean 0.781841 
7 AvgBwdSegmentSize 0.781841 
8 FwdHeaderLength 0.778016 
9 DestinationPort 0.77582 

10 BwdPacketLengthMax 0.760317 
11 InitWinbytesforward 0.708411 
12 AvgFwdSegmentSize 0.706064 
13 FwdPacketLengthMean 0.706064 
14 FwdPacketLengthMax 0.701009 
15 BwdHeaderLength 0.682524 


3. RESULTS AND EVALUATION 

The detection performance of the semi-supervised K-Means algorithm was measured in this experiment. 
WEKA's performance of clustering and feature selection by information gain. Accuracy measures the algorithm's 
ability to detect attacks in both unattack and attack traffic. The accuracy computed according (4). 


TP+TN (4) 


Accuracy ———————. 
TP+TN+FP+FN 


The performance of the detection engine can also be measured by its accuracy. The machine's ability 
to predict traffic based on its actual conditions is indicated by its accuracy. In other words, the capacity of a 
machine to precisely classify a class. Figure 2 and Table 3 present values of generated centroids of the proposed 
method. It is providing two optimum centroids to classify traffic into normal and DDoS attack. Table 4 displays 
K-Means accuracy performance. The results shown in Table 4 illustrate that the test 1 was the best choice to 
achieved accuracy with 2 centroids that labeled into normal and another with DDoS. Figure 3 present 
performance comparison between the proposed K-Means and Canopy. 
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Figure 2. Distributed traffic of proposed centroids 


Table 3. Values of generated centroids 


No. Feat. Name Centroidl (DDoS) Centroid2 (Normal) 
1 Destinatio Port 80 80 
2 TotalLength of FwdPackets 288 30 
3 TotalLength of Bwd Packets 11724 0 
4 FwdPacketLengthMax 288 6 
5 FwdPacketLengt Mean 13.714286 6 
6 BwdPacketLengthMax 5792 0 
7 BwdPacketLengthMean 732.75 0 
8 FwdHeaderLength 680 100 
9 BwdHeaderLength 520 0 
10  AveragePacketSize 324.648649 72 
11 AvgFwdSegmentSize 13.714286 6 
12 AvgBwdSegmentSize 732.75 0 
13 SubflowFwdBytes 288 30 
14 SubflowBwdBytes 1724 0 
15 Init_Win_bytes_forward 29200 256 
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Table 4. Accuracy of K-Means and canopy algorithms 


Test No. Keans Accuracy (%) Canopy Accuracy (%) 
Test1 (2 centroids) 79.60 72.30 
Test2 (4 centroids) 68.90 65.70 
Test3 (6 centroids) 42.10 55.90 
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Figure 3. Performance of proposed method and canopy algorithm 


4. CONCLUSION AND FUTURE WORK 

This paper presents the algorithm to classify DDoS attacks using a semi-supervised machine learning 
method. It starts with traffic statistics that aren't labeled that are gathered from three parts of the victim-end 
defense, which is the web server. Proposed hybrid feature selection techniques to reduction dataset feature 
from 84 to 15 of the features are used to final labeling of traffic flows in proposed framework. K-Means 
clustering algorithm group the data that doesn't have labels. The scheme used a representative part of the 
benchmark CICIDS2017 dataset with new normal and attack centroids to test how well labels were given. In 
the future, we want to find better ways to voting based label traffic online, add more ML algorithms to the 
clustering and classification processes, and put the proposed four centroids into the online detection framework. 
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APPENDIX 
Table 1. Recent releated work 
No References Technique Name Results Discussion 
1 [11] Fuzzy c-means clustering The research is based on network traffic characteristics retrieved from the 
network that might indicate the presence of DDoS botnets in the network. 
According to the findings of the experiments, the detection rate is around 
95%, with only 6% of false positives. 
2 [12] Co-clustering, Information The entropy estimator examines the entropy of network traffic data over a 
Gain Ratio, and the Extra- sliding time-based frame. Co-clustering divides incoming network traffic 
Trees technique and into three groups when entropy exceeds thresholds. The information gain 
estimating entropy ratio (IGR) is calculated using the average network header entropy between 
each cluster and the current time frame subset. Extra-Trees ensemble 
classifiers are used for preprocessing and classification of high-gain 
anomalous network traffic data clusters. 
3 [13] Clustering Using The intrusion detection method described in this article combines several 


Representative (CURE), 
Entropy 


unsupervised data mining techniques. Entropy theory in terms of packet 
windowing and data mining are integrated to identify the DDoS attack in 
network flow. As a cluster analysis, clustering using representative (CURE). 
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Table 1. Recent releated work (continue) 


No References Technique Name Results Discussion 

4 [14] Random Forest, Bagging, and This study proposes a semi-supervised multi-layered clustering (SMLC) 

AdaboostM1 model for detecting and preventing network intrusion. SMLC may learn 
from partially labeled data and achieve detection performance comparable 
to IDPS based on supervised machine learning. The performance of SMLC 
on two datasets of the benchmark network-intrusion, NSL and Kyoto 2006, 
is compared to one of a well-known semi-supervised approach (tri-training) 
and the supervised ensemble ML models, particularly Random Forest, 
Bagging, and Adaboost. 

5 [15] K-Means algorithm and This study proposes an enhanced density-based initial cluster centers 
Hybrid Feature Selection selection method after a Hadoop-based hybrid feature selection technique 

to find the most useful feature sets, in order to address the problem of 
outliers and local optimums. 

6 [16] Verification approach The researchers present a new semi-supervised intrusion detection model 
that utilizes a verification strategy to produce consistent classifications 
across time, even when model updates are not available. Use semi- 
supervised learning to update the underlying machine learning models 
without the requirement for human interaction. The pool verifier, depending 
on the conclusion of the pool of classifiers, uses the classifications 
recognized by the verifier to determine whether it is reliable or not. 

7 [17] K-Nearest Neighbor (K-NN) This study proposes FloodDetector, an effective architecture for detecting 

and Artificial Neural Network known and unknown flooding assaults in SDN. It is a controller-agnostic 

(ANN) SDN application that employs two machine learning classifiers to detect 
both known and unknown flooding attacks: K-nearest neighbor (K-NN) and 
artificial neural network (ANN). 

8 [18] Deep neural networks To detect intelligent systems, this study proposes the use of machine 
learning frameworks. The study uses deep learning to distinguish between 
benign data exchange and harmful data traffic attacks. 

9 [19] The N-Gram line generation, This paper offers network traffic flow-based approach for mobile malware 
feature selection algorithm, detection that assumes each HTTP flow as a document and analyzes HTTP 
and SVM flow requests using natural language processing string analysis. An 
algorithm effective malware detection model is created using the N-Gram line 

generation, feature selection method, and SVM algorithm. 

10 [20] DBSCAN, SVM, and In this paper, a hybrid supervised/unsupervised strategy is proposed. First, 
Random Forest the clustering algorithm separates the anomalous traffic from the regular 

data by using numerous flow-based criteria. After determining the statistical 
characteristics each cluster shares, they can be assigned names using a 
categorization method. The authors conduct an evaluation of the proposed 
method by processing vast amounts of data. 

# Our K-Means algorithm 1- Training and testing with CICIDS2017 dataset. 

Proposed 2- Proposed hybrid feature selection techniques 
3- Produce DDoS and Normal centroids. 
4- Evaluation. 
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